-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Task Fusion #113
base: branch-24.03
Are you sure you want to change the base?
Task Fusion #113
Conversation
Object cycle fix
Fix for Issue 77
legate/core/runtime.py
Outdated
#self.validIDs.add(5) #Convert op | ||
#self.validIDs.add(7) #dot op | ||
#self.terminals.add(7) #dot op is only fusable as a terminal op in a window | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially introduced the notion of a 'terminal' op that is fusable only at the end of a window. I don't use this so will remove it.
legate/core/runtime.py
Outdated
intervals.append((start,end)) | ||
#print("proj", intervals) | ||
return True, intervals | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated, the above constraint isn't particularly cheap....and it doesn't seem to ever be "used" I will look into the perf, but I suspect the get_projection
method is the culprit here
legate/core/runtime.py
Outdated
fusion_checker.register_constraint(IdenticalProjection()) | ||
fusion_checker.register_constraint(ValidProducerConsumer()) | ||
can_fuse,fusable_sets, partitions = fusion_checker.can_fuse() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fusion constraints are registered above
frint("launch ufused input", op._task_id, input) | ||
self.propogateFuture(op) | ||
op.launch(op.strategy) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the above func will be cleaned up and simplified in the next day or so
if store._key_partition is not None: | ||
partition=store._key_partition | ||
else: | ||
partition = store.compute_key_partition(restrictions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the store already has a key partition (set by the fuser), no need to compute it
There may be a more elegant way of doing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should check if the key partition satisfies the restrictions before you reuse it. And I think this logic should be moved to compute_key_partition
. Just so you know, I'm currently refactoring the code that manages legate stores, so don't make any micro optimization like this, as that will become obsolete in the near future.
src/core/data/store.cc
Outdated
fprintf(stderr, "Found an uninitialized Legate store\n"); | ||
assert(false); | ||
//fprintf(stderr, "Found an uninitialized Legate store\n"); | ||
//assert(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will come back to this; this was done to make constant optimization work; I will check if I indeed need this assertion removed/modified
@@ -794,8 +794,8 @@ def driver(): | |||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all changes to this file will be reverted before merge
name="legate.core", | ||
version="0.1", | ||
name="legate-core", | ||
version="21.10.00", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the current version indeed 0.1? If so I'll undo the above change
def build_task(self, launch_domain, argbuf): | ||
self._req_analyzer.analyze_requirements() | ||
self._out_analyzer.analyze_requirements() | ||
|
||
#pack fusion metadata | ||
self.pack_fusion_metadata(argbuf, self._is_fused, self._fusion_metadata) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this design. The fusion task shouldn't be any different from a normal Legate task. I believe the fusion metadata can be passed as a bunch of scalar arguments to the fusion task. If we do that, then we don't need to make every place in the core handle fusion. Making the default path aware of task fusion doesn't sound like a good design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is def worth discussing (I pondered this myself):
I did think about making them scalars, but was not sure if there were any other implications/downsides of storing all the metadata that way (eg are there a maximum number of scalars we can give a task), since we'd be sending in quite a few scalars to the task (or rather multiple lists, each of which can have >50 scalars; the number of scalars in 4/5 of these lists is equivalent to the fused_op length, even with potential deduplication of stores).
If we did this, in the task's scalars
array, we'd thus be mixing "metadata" scalars (which are really a bunch of lists) with the actual scalars used by the sub-tasks, which seemed to not really adhere to the abstraction of having a dedicated "scalars" array in context
data structure. I thus chose to serialize it as Legate Task argument data. As you stated, the downside is that the core/default has to be "fusion aware".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But if you don't see any potential downsides of making them all scalars, then yeah it could totally be implemented to be scalar based. Would mostly just have to move book-keeping logic from the deserializer into the Fused Legate Task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel what's actually missing here is an ability to pass an arbitrary chunk of data (probably a memoryview
object in Python) to a task, which in theory possible, so we should probably extend the buffer builder interface to take an an arbitrary memory view object and pack it as a single contiguous chunk. The assumption here is that the task knows how to interpret those bytes. It shouldn't be too difficult for me to add it.
TaskContext(const Legion::Task* task, | ||
const std::vector<Legion::PhysicalRegion>& regions, | ||
Legion::Context context, | ||
Legion::Runtime* runtime); | ||
|
||
TaskContext(const Legion::Task* task, const std::vector<Legion::PhysicalRegion> regions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this constructor necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each sub-op is called like this (via function-pointer):
for (int i=0; i<fused_op_len; i++){
auto descriptor = Core::gpuDescriptors.find(opIDs[i])->second;
descriptor(context);
}
However, the context
above is not the same as the parent tasks's context
; Whereas the parent task's context contains the "union" of all input
outputs
reductions
and scalars
over all subops, the above context should only include the inputs
, outputs
etc for sub-operation/task i
. I used this constructor to build the operation-specific contexts, for holding the specific set of inputs
outputs
etc.
This brings up a potential issue of that the new context's task
and regions
fields are currently not specific to the sub-op. I'm not sure how much of an issue that is, as I only need the context structure to hold stores and scalars for the sub-ops. Ideally I wanted to avoid any performance costs of creating of new Task objects in the Fused Task, so I elected to keep the new context
object as such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You do need a new context in this case, but my question is more on why you couldn't use the existing constructor. The constructor you added leaves the runtime and context member uninitialized, leading to a crash if the child task tries to access the runtime (though no task in cunumeric is actually doing that). I think the context you are creating here should still have a valid runtime and context pointer.
That said, I think you actually need a slightly different change here, which allows TaskContext to override the legion task's id. The task
object here is of the fusion task, so if you pass it as-is to a child task, the child task may think that it's not what it thinks it is. I believe no cuNumeric task is doing actually doing it, but leaving this unaddressed can be a foot gun later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it's a valid point. I initially tried using the normal constructor and got runtime errors with from the task deserializer. Recreating the issue I get: ( the error is from span.h : Assertion pos < size_ failed.
)
legion_python: /home/shiv1/fmultiGPU/legate.core/src/core/utilities/span.h:38: decltype(auto) legate::Span<T>::operator[](size_t) [with T = const Legion::PhysicalRegion; size_t = long unsigned int]: Assertion `pos < size_' failed.
Signal 6 received by node 0, process 3131805 (thread 7f8d47bf3000) - obtaining backtrace
Signal 6 received by process 3131805 (thread 7f8d47bf3000) at: stack trace: 19 frames
[0] = /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f8d70ce33c0]
[1] = /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f8d707f718b]
[2] = /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b) [0x7f8d707d6859]
[3] = /lib/x86_64-linux-gnu/libc.so.6(+0x25729) [0x7f8d707d6729]
[4] = /lib/x86_64-linux-gnu/libc.so.6(+0x36f36) [0x7f8d707e7f36]
[5] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(+0x5a95d) [0x7f8d4428595d]
[6] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(legate::TaskDeserializer::_unpack(legate::Store&)+0x10e) [0x7f8d44286d9e]
[7] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(legate::TaskContext::TaskContext(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)+0x7f6) [0x7f8d44278626]
[8] = /home/shiv1/fmultiGPU/legate.core/install/lib/libcunumeric.so(cunumeric::FusedOpTask::gpu_variant(legate::TaskContext&)+0x6e2) [0x7f8d1ce0a852]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at the time I didn't spend much time debugging this and just wanted something that worked for the time being (the constructor I made didn't call the deserializer. I didn't want to reserialize/deserialize things anyways because it involves additional overhead). I'm guessing it has to do with the number futures or scalars passed in, but I can totally dive a little deeper into what's causing this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the crash is due to the fusion metadata you added that the current serializer doesn't understand, so this issue is tied with the fusion awareness introduced in the code base, which I suggest removing. Actually, we shouldn't be deserializing the task argument for the child task context, so you're right that there be a constructor that doesn't serialize the argument. I guess we could add an argument to the existing constructor to control the behavior.
@@ -121,13 +130,14 @@ class TaskContext { | |||
public: | |||
ReturnValues pack_return_values() const; | |||
|
|||
private: | |||
public: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you want to expose the private members here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh this should be reverted. This is an artifact from when we were using the inline launcher, and I needed access to these
input_start, output_start, offset_start = 0,0,0 | ||
reduction_start, scalar_start, future_start = 0,0,0 | ||
|
||
for op in ops: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we'd eventually want coalescing of arguments here as well. Right now, the fusion code passes the same store used by multiple tasks multiple times, hoping that the launcher at least coalesces region requirements for those duplicates. But, we can identify that the same store is being used in multiple tasks when we scan them for dataflow analysis. So, we could potentially pass a deduplicated list of stores to the fusion task with a mapping between the original indices and those in the store list that is passed to the fusion task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we could deduplicate these, and I don't imagine this being too difficult. We would just have to place the offset of the dedpulicated store (eg in line 1145) in the offsets
array, and it should just work
Thanks a lot for all your hard work on this PR. I've left several comments and questions, but you don't necessarily need to address all of them. I'll pick this up probably after the break and we can work together to make this ready for the merge. |
ty.uint32, | ||
) | ||
ty.uint32,) | ||
self._window_size=50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be hardcoded, will remove
@shivsundram @magnatelee What is the status of this PR? |
@marcinz I'm planning to port/rebase this once the planned refactoring on the core is done. |
PR for Task Fusion
Task Fusion PR guide
This PR implements:
i. Fusibility determination and creation of the fused ops, done in the python code (see
/core/runtime.py
)ii. Issuing of a Fused task's sub-operations, currently done in cunumeric c++ code. This can be viewed in the companion PR Task Fusion, Constant Conversion Optimization, and 27pt stencil benchmark cupynumeric#150
iii. An interface for accessing each registered Legate Tasks' function pointer, used for efficiently launching a fused task's sup-tasks
i. eliminate issuing of expensive
convert
tasks for static/constant scalars (ie there's no need to defer execution of this type conversion, when the underlying Future's value is a constant embedded in the code)ii. This lives in cunumeric, but the changes can be viewed here (will be contributed in separate PR)
i. re-organization of the synchronizations in black_scholes. Instead checking the calls and puts!=NaN at the end of every benchmarking run, we check this only after all benchmarking runs have concluded.
ii. A new 27-pt stencil benchmark. For demonstrating how fusion boosts perf even though it theoretically decreases task-level parallelism
iii . Lives in cunumeric, but can be viewed in companion PR here
Required for merge (ie TODO):
-Test OpenMP Support
Note
I am currently unable to move the Fused Task to the core; as doing this causes issues with the fused tasks's mapper tag. Until then, the Fused Task will live in Cunumeric, in the companion PR nv-legate/cupynumeric#150
Overview of Optimizations
Task Fusion
Given a window of buffered Legate tasks, fusion finds sub-windows of operations that can each be aggregated into a single launch/fused operation. In other words, given a 100% fusable window of N tasks, the runtime would normally issue O(N) meta/utility tasks (mapping calls, scheduling etc) for this window, whereas the fused operation would only require O(1) such meta-tasks, thus reducing overhead. There are currently four constraints dictating op fusability:
Fusability Constraints
1 Fusable Operations: Only a certain subset of Legate tasks are fusable (eg unary, binary, ternary ops)
2 Producer-Consumer: In the same fused op, one cannot write to one store representing an underlying buffer/regionfield, and then read from a different store representing the same underlying buffer/regionfield. This prevents the existence of invalid combinations of aliased partitions in a fused operation (eg the reads, adds, and mults in an in-place stencil are fusable, but not the write/update of the buffer at the end of the stencil iteration)
3 Launch Shape equivalence: All sub-operations in a fused task must have the same launch shape.
4 Projection Functor equivalence: If a store is used by mutliple sub-ops in a fused task, every subop's functor for indexing into that store must be equivalent.
5(Temporary) Existence of CuNumeric: Until the FusedTask is moved to the core, we can only fuse if the cuNumeric context exists (note the Fused Task being in cuNumeric does not preclude it from fusing non-cuNumeric tasks)
All constraints live in
core/runtime
. Each constraint'sapply
function is used to apply the constraint to a window of ops. This, by Dec 8, will be replaced with a new set of functions which apply constraints more efficiently (most of these are already in the PR and currently called apply2, but will just be calledapply
eventually)_Currently, Constraint 4 is the most expensive to run, and it doesn't seem to ever trigger. If we can come up with a better way of testing this, runtime will further improve
Constant Optimization
Currently implemented in cunumeric/array.py, when issuing binary_ops. Instead of creating a
convert
operation to convert a known scalar constant (eg from fp64 to tp32), we simply create the scalar in place and eliminate the need for said operation. Not only does this enable larger levels of fusion (asconvert
is not a fusable op), but the operation in itself provides large performance boosts. Note this only triggers when the constant is aDeferredArray
(not Eager, as this optimization actually slows down ops using EagerArrays )Performance:
Depending on input sizes and GPU counts (1-16 currently tested) and GPU type(Pascal, Volta currently tested), we see
~1.6-2.x speedup for black scholes
~1.6-2x speedup for 27pt stencil
~1.15-2x speedup for Logistic Regression
~1.15 speedup for Jacobi Iteration
No statistically significant speedup for conjugate gradient (between 1.0x-1.05x)
The maximum
window_size
in both the fused and unfused baseline (ie nv-legate reference) is set to 50 in these tests. Speedup is even higher when the baseline uses a window_size of 1 (current setting in reference), but the fused code still uses n=50 (current setting in PR)Testing
Currently running (have been testing as I go). Have not seen a single instance where fused and unfused generate different outputs.
Known Issues: